Vectorisation #589

sv2518 · 2020-06-27T18:37:09Z

I would like to merge TJs changes to enable vectorisation of TSFC kernels.

This branch depends on https://github.com/firedrakeproject/loopy/tree/cvec.

#590 in PyOP2 (+ https://github.com/firedrakeproject/firedrake/tree/wence/fix/interpolate-max in firedrake) should be merged first, so I can drop those commits in this branch.

tj-sun

I'm quite out of sync with the changes in PyOP2 and Firedrake since about 18 months ago, so I'm probably not qualified to review this. However, I did take a look and added some comments.

I remember this (or a similar) branch did pass all tests at one point of time (I can't quite recall if it is locally on my Linux machine or on Jenkins), but it is very likely that changes and/or new examples might introduce new issues.

One major feature that I didn't touch at all back then is detecting hardware/compiler environment and apply vectorisation accordingly. This is important esp. if you want to switch on vectorisation by default. I guess we could have the users to op in and not turn this on by default as the first step to get things going, but we should at least have some Jenkins jobs that test vectorisation in that case.

pyop2/sequential.py

pyop2/configuration.py

pyop2/codegen/rep2loopy.py

pyop2/compilation.py

pyop2/sequential.py

pyop2/compilation.py

sv2518 · 2020-07-01T08:54:13Z

The firedrake test test_real_space.py::test_real_interpolate is failing with the following generated code and the answer is 4 times the answer of what we expect. It seems like there is another bug in the splitting. I think the iname should not be split at all, it is only of extent 1.

---------------------------------------------------------------------------
KERNEL: wrap_expression_kernel
---------------------------------------------------------------------------
ARGUMENTS:
end: ValueArg, type: np:dtype('int32')
glob0: type: np:dtype('float64'), shape: (1), dim_tags: (N0:stride:1) aspace: global
glob1: type: np:dtype('float64'), shape: (1), dim_tags: (N0:stride:1) aspace: global
start: ValueArg, type: np:dtype('int32')
---------------------------------------------------------------------------
DOMAINS:
[end, start] -> { [n_outer, n_batch] : n_batch >= start - 4n_outer and 0 <= n_batch <= 3 and n_batch < end - 4n_outer }
{ [expr_p0] : expr_p0 = 0 }
---------------------------------------------------------------------------
INAME IMPLEMENTATION TAGS:
expr_p0: None
n_batch: c_vec
n_outer: None
---------------------------------------------------------------------------
TEMPORARIES:
_zeros: type: np:dtype('float64'), shape: () scope:global
---------------------------------------------------------------------------
INSTRUCTIONS:
    for n_outer, n_batch
↱       glob0[0] = 0  {id=statement1}
├↱      ... nop  {id=expr__start}
││      for expr_p0
└└↱       glob0[expr_p0] = glob0[expr_p0] + glob1[0]  {id=expr_insn}
  │     end expr_p0
  └     ... nop  {id=statement0}
    end n_outer, n_batch
---------------------------------------------------------------------------

wence- · 2020-07-01T09:37:06Z

Can you dump the wrapper kernel before this splitting transform so we can have a look at what went wrong? Do loopy.dump_as_python(kernel) to get something that can be used to construct it independently.

sv2518 · 2020-07-01T09:41:31Z

import loopy as lp
import numpy as np

knl = lp.make_kernel(
    [
    "[end, start] -> { [n] : start <= n < end }",
    "{ [expr_p0] : expr_p0 = 0 }",
    ],
    """
    glob0[0] = 0 {id=statement1, inames=n}
    ... nop {id=expr__start, dep=statement1, inames=n}
    glob0[expr_p0] = glob0[expr_p0] + glob1[0] {id=expr_insn, dep=expr__start:statement1, inames=n:expr_p0}
    ... nop {id=statement0, dep=expr_insn, inames=n}
    """, [
        lp.ValueArg(
            name='start', dtype=np.int32),
        lp.ValueArg(
            name='end', dtype=np.int32),
        lp.GlobalArg(
            name='glob0', dtype=np.float64,
            shape=(1,), for_atomic=False),
        lp.GlobalArg(
            name='glob1', dtype=np.float64,
            shape=(1,), for_atomic=False),
        lp.TemporaryVariable(
            name='_zeros', dtype=np.float64,
            shape=(), for_atomic=False,
            address_space=lp.AddressSpace.GLOBAL,
            read_only=True,
            initializer=np.array(0.),
            ),
        ], lang_version=(2018, 2))

wence- · 2020-07-01T11:01:23Z

The loop splitting code doesn't seem to privatise the reduction variables properly for global reductions.

We start with (I've assumed start = 0, end mod 4 = 0 and end > start to remove some conditionals):

void loopy_kernel(int32_t const start, int32_t const end,
                  double *__restrict__ glob0,
                  double const *__restrict__ glob1) {
  for (int32_t n = 0; n <= -1 + end; ++n) {
    glob0[0] = 0.0;
    {
      int32_t const expr_p0 = 0;

      glob0[0] = glob0[0] + glob1[0];
    }
  }
}

So far, so good.

If I lp.split_iname(knl, "n", 4, inner_tag="c_vec"), then after code generation I get

void loopy_kernel(int32_t const start, int32_t const end,
                  double *__restrict__ glob0,
                  double const *__restrict__ glob1) {
  for (int32_t n_outer = 0; n_outer <= (-4 + end) / 4; ++n_outer) {
    for (int32_t n_inner = 0; n_inner <= 3; ++n_inner) {
      glob0[0] = 0.0;
    }
    {
      int32_t const expr_p0 = 0;

      for (int32_t n_inner = 0; n_inner <= 3; ++n_inner)
        glob0[0] = glob0[0] + glob1[0];
    }
  }
}

Notice how this is wrong, because we zero glob0[0] four times and then sum into it four times, rather than zeroing on every loop iteration at the same time as summing. This occurs, I think, because loopy doesn't know that the zeroing of glob0 is really in the same loop nest as the summation (because the former is created by the pyop2 wrapper gen, and the latter by tsfc, so they have different effective inames).

I wonder what one should have done.

sv2518 · 2020-07-01T11:55:12Z

I don't know. Which instruction belongs to which loop is usually defined with within_inames. Maybe we missed to specify that somewhere. But since the the instruction is in the loop body, I assume this is not it. Maybe the fact that it's wrapped in a Block is the problem.

wence- · 2020-07-01T12:03:19Z

I don't know. Which instruction belongs to which loop is usually defined with within_inames. Maybe we missed to specify that somewhere. But since the the instruction is in the loop body, I assume this is not it. Maybe the fact that it's wrapped in a Block is the problem.

The problem here is that the expr_p0 iname comes from tsfc, whereas the glob0[0] = 0 assignment from the wrapper. If they had the same iname, it would be OK, but we can't make that happen. I reported a loopy issue here: https://gitlab.tiker.net/inducer/loopy/-/issues/217

wence- · 2020-07-01T12:04:05Z

A workaround is to add yet another conditional to the case in which we switch vectorisation off again, see https://github.com/OP2/PyOP2/pull/589/files#diff-c87324f585685c592b11a1e78337d775R183

sv2518 · 2020-07-01T15:55:44Z

Do you expect this only to occur for writing into globals? I added the following conditional and was hoping the other failing test passes as well, but no.

has_wg = any(isinstance(arg.data, Global) and arg.access == WRITE for arg in self._args)
if isinstance(self._kernel.code, loopy.LoopKernel) and not (has_matrix or has_rw or has_wg):

wence- · 2020-07-01T16:12:28Z

No clue, but I'm not sure that it should only trigger then the arg is WRITE, I suspect it should trigger when the arg is not READ

wence- · 2020-07-01T16:13:48Z

All of these workarounds are not very satisfactory solutions: it feels like the implementation of the cvec and omp_simd backends in loopy are not very complete. But I guess one should get as much code merged as possible and then fix that.

sv2518 · 2020-07-01T16:26:54Z

Yes to both comments.

The line smoother test is still failing. It's only the one for the periodic hierarchy. The solver diverges in the end. I have no idea how to debug this.

wence- · 2020-07-01T16:38:46Z

The line smoother test is still failing. It's only the one for the periodic hierarchy. The solver diverges in the end. I have no idea how to debug this.

Check the assembly of the operator is correct.
Check the residual
Can you get failure on a mesh with a single column? (It will take a different number of iterations from what the test asserts)

wence- · 2020-07-02T11:36:06Z

A workaround is to add yet another conditional to the case in which we switch vectorisation off again, see

Here's a proposed better fix. The loopy folks suggest privatizing (via a temporary variable) the reduction (see discussion here: https://gitlab.tiker.net/inducer/loopy/-/issues/217).

We can do this by producing code with temporaries at codegen time. I pushed 0677a24 which does this, and hopefully fixes the problem (and doesn't introduce other ones).

wence- · 2020-07-02T12:16:57Z

OK, that was a bad fix, will revert for now.

sv2518 · 2020-07-03T13:33:39Z

What went wrong with your fix, @wence- ?

wence- · 2020-07-03T13:37:22Z

Oh, it revealed firedrakeproject/firedrake#1768.

sv2518 · 2020-07-06T16:28:27Z

The kernel, that causes the failure in tests/regression/test_line_smoother_periodic.py::test_line_smoother_periodic is attached in vectorised and unvectorised form. The bug is in the value of n_outer in the initial slab.

Archive.zip

wence- · 2020-07-07T12:56:23Z

Can you show what values the start and end parameters have that result in bad values for n_outer in the initial slab?

Translating that loop structure to python I get:

def single(start, end):
    for i in range(start, end):
        print(i)


def fdivpos(a, b):
    if a < 0:
        a = a - (b - 1)
    return a // b


def unroll(start, end):
    n_outer = start - ((3 + 3*start) // 4)
    print("initial")
    if end - start - 1 >= 0:
        for b in range(start - 4*n_outer, 4 if end - 4*(n_outer+1) >= 0 else end - 4*n_outer):
            print(4*n_outer + b)
    print("bulk")
    for n in range(1 + start - ((3 + 3*start)//4), (3+end)//4 - 1):
        for b in range(4):
            print(4*n + b)

    print("final")
    n_outer = fdivpos(end - 1, 4)
    if 4*n_outer - start - 1 >= 0:
        for b in range(0, end - 4*n_outer):
            print(4*n_outer + b)

It's not obvious that this gives wrong iteration counts.

sv2518 · 2020-07-07T21:41:25Z

I think, the reason one cannot see the problem in the iteration count in your script is that the slab n_outer in the C kernel is differently rounded. To reproduce the problem you would need to exchange n_outer = start - ((3 + 3*start) // 4) with math.floor(start - ((3 + 3*start) / 4)). In that case the iterations over the initial slab are missing.
Start and end sets in the test are: (0,18), (18,18), (0,72), (72,72).

sv2518 · 2020-07-07T21:45:01Z

So I think in the C kernel we want int const n_outer = start - loopy_floor_div_pos_b_int32((3 + 3 * start), 4);

wence- · 2020-07-08T10:41:40Z

I think, the reason one cannot see the problem in the iteration count in your script is that the slab n_outer in the C kernel is differently rounded. To reproduce the problem you would need to exchange n_outer = start - ((3 + 3start) // 4) with math.floor(start - ((3 + 3start) / 4)). In that case the iterations over the initial slab are missing.
Start and end sets in the test are: (0,18), (18,18), (0,72), (72,72).

I'm pretty sure python and C ints have the same rounding behaviour.

I did a bit more digging, it turns out this is not a bug in the loopy codegen, but rather in values that we're sending in to the kernel.

You can reproduce by comparing the output with vectorisation off and on of this code:

from firedrake import *


def periodise(m):
    coord_fs = VectorFunctionSpace(m, "DG", 1, dim=2)
    old_coordinates = m.coordinates
    new_coordinates = Function(coord_fs)
    domain = "{[i, j]: 0 <= i < old_coords.dofs and 0 <= j < new_coords.dofs}"
    instructions = """
    <float64> Y = 0
    <float64> pi = 3.141592653589793
    for i
        Y = Y + old_coords[i, 1]
    end
    for j
        new_coords[j, 0] = atan2(old_coords[j, 1], old_coords[j, 0]) / (pi* 2)
        new_coords[j, 0] = new_coords[j, 0] + (1 if new_coords[j, 0] < 0 else 0)
        new_coords[j, 0] = 1 if (new_coords[j, 0] == 0 and Y < 0) else new_coords[j, 0]
        new_coords[j, 0] = new_coords[j, 0] * Lx[0]
        new_coords[j, 1] = old_coords[j, 2] * Ly[0]
    end
    """
    cLx = Constant(1)
    cLy = Constant(1)
    par_loop((domain, instructions), dx,
             {"new_coords": (new_coordinates, WRITE),
              "old_coords": (old_coordinates, READ),
              "Lx": (cLx, READ),
              "Ly": (cLy, READ)},
             is_loopy_kernel=True)
    return Mesh(new_coordinates, reorder=False)

H = 1
N = 3

mesh = periodise(CylinderMesh(N, N, 1.0, H, reorder=False))

V = FunctionSpace(mesh, "P", 1)
x, _ = SpatialCoordinate(mesh)
xc = interpolate(x, V)

print(xc.dat.data_ro)

# "working"
=> [0.66666667 1.         0.66666667 0.33333333 0.33333333 0.
 0.         0.66666667 0.33333333 0.66666667 0.33333333 0.        ]

# "broken"
=> [0.66666667 1.         0.66666667 0.33333333 0.33333333 1.
 0.         0.66666667 0.33333333 0.66666667 0.33333333 0.        ]

What's happening here is that we're interpolating a DG field into a CG space, and the values we get are dependent on the order in which we execute things. Now, PyOP2 has never guaranteed that it won't change the iteration order, so arguably this code has been relying on unspecified behaviour.

Actually fixing this properly needs a big rethink of the way the multigrid grid transfer is done, however, I can get the test to pass at least, by applying this patch in loopy:

diff --git a/loopy/codegen/loop.py b/loopy/codegen/loop.py
index c71554a4..43bff2c3 100644
--- a/loopy/codegen/loop.py
+++ b/loopy/codegen/loop.py
@@ -105,9 +105,9 @@ def get_slab_decomposition(kernel, iname):
         if upper_bulk_bound is not None:
             bulk_slab = bulk_slab.add_constraint(upper_bulk_bound)
 
-        slabs.append(("bulk", bulk_slab))
         if lower_slab:
             slabs.append(lower_slab)
+        slabs.append(("bulk", bulk_slab))
         if upper_slab:
             slabs.append(upper_slab)

sv2518 · 2020-07-08T11:30:33Z

If I execute that test I get a different output

"broken"
[0.         0.33333333 0.         0.33333333 0.66666667 0.66666667
 1.         0.33333333 0.66666667 1.         0.33333333 0.66666667]
"working, including your temporary fix":
[1.         0.33333333 1.         0.33333333 0.66666667 0.66666667
 1.         0.33333333 0.66666667 1.         0.33333333 0.66666667]

sv2518 · 2020-07-21T13:26:54Z

The last failing firedrake test is tests/extrusion/test_real_tensorproduct.py now.

sv2518 · 2020-07-22T12:04:03Z

#590 in PyOP2 (+ https://github.com/firedrakeproject/firedrake/tree/wence/fix/interpolate-max in firedrake) should be merged first, so I can drop those commits in this branch.

wence- · 2020-07-22T14:01:48Z

#590 in PyOP2 (+ https://github.com/firedrakeproject/firedrake/tree/wence/fix/interpolate-max in firedrake) should be merged first, so I can drop those commits in this branch.

Done.

sv2518 · 2020-07-22T15:25:50Z

I have py-cpuinfo as required dependency. I added this in firedrake https://github.com/firedrakeproject/firedrake/compare/trigger-tests-for-vec and also in PyOP2 in this commit 1cf7698. Do I need both or is it enough in PyOP2?

Only works when kernel is a Loopy kernel.

…he tree vectorisation flag for our vectorisation anyways.

… loopy codebase.

…the variable.

Don't vectorise, if complex arguments. Check if vect strategy specified, otw dont vectorise.

sv2518 · 2022-07-07T14:20:09Z

Superseded by #654

sv2518 requested review from tj-sun and wence- June 27, 2020 18:37

tj-sun reviewed Jun 30, 2020

View reviewed changes

wence- force-pushed the vectorisation branch from 0677a24 to e471e0e Compare July 2, 2020 12:17

sv2518 force-pushed the vectorisation branch 3 times, most recently from 3d47231 to fdf3fad Compare July 6, 2020 10:32

sv2518 force-pushed the vectorisation branch from 5643bff to 034283e Compare July 7, 2020 18:00

sv2518 force-pushed the vectorisation branch 3 times, most recently from 9d6e625 to 6cda3c1 Compare July 20, 2020 12:26

sv2518 force-pushed the vectorisation branch 2 times, most recently from 6420e9e to 856b6aa Compare July 22, 2020 15:16

sv2518 marked this pull request as ready for review July 22, 2020 15:17

tj-sun and others added 17 commits September 29, 2020 10:44

codegen: Implement SIMD vectorisation

60d89d8

Only works when kernel is a Loopy kernel.

add omp simd vectorization mode

72f3b3f

add openmp flag and by pass workaround flag

9a3a549

DROP BEFORE MERGE: test with correct loopy branch

3a2d5ff

Turn of tree vectorize for certain gcc compilers. We might not need t…

beedc5f

…he tree vectorisation flag for our vectorisation anyways.

Add simd compiler flags.

567ec0c

Remove time configuration.

1f0ea23

Default SIMD width.

5e885a7

Generate CVec Target with batch size infomation and move typedef into…

db6c2db

… loopy codebase.

Move zero declaration to loopy code base to be more robust in naming …

8c172f0

…the variable.

Added conditionals when to vectorise:

17ec318

Don't vectorise, if complex arguments. Check if vect strategy specified, otw dont vectorise.

Drop omp vectorisation.

ec03e1e

Add -march=native everywhere.

f4f2beb

Silence warnings.

dc8ec6d

Change vector tag.

5b49dd6

Drop initialisation with batch size.

e228073

Remove useless comment.

6bf9de3

sv2518 force-pushed the vectorisation branch from 87bb49a to 6bf9de3 Compare September 29, 2020 08:45

sv2518 closed this Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorisation #589

Vectorisation #589

sv2518 commented Jun 27, 2020 •

edited

Loading

tj-sun left a comment

sv2518 commented Jul 1, 2020

wence- commented Jul 1, 2020

sv2518 commented Jul 1, 2020

wence- commented Jul 1, 2020

sv2518 commented Jul 1, 2020

wence- commented Jul 1, 2020

wence- commented Jul 1, 2020

sv2518 commented Jul 1, 2020

wence- commented Jul 1, 2020

wence- commented Jul 1, 2020

sv2518 commented Jul 1, 2020

wence- commented Jul 1, 2020

wence- commented Jul 2, 2020

wence- commented Jul 2, 2020

sv2518 commented Jul 3, 2020

wence- commented Jul 3, 2020

sv2518 commented Jul 6, 2020

wence- commented Jul 7, 2020

sv2518 commented Jul 7, 2020 •

edited

Loading

sv2518 commented Jul 7, 2020

wence- commented Jul 8, 2020

sv2518 commented Jul 8, 2020

sv2518 commented Jul 21, 2020

sv2518 commented Jul 22, 2020

wence- commented Jul 22, 2020

sv2518 commented Jul 22, 2020

sv2518 commented Jul 7, 2022

Vectorisation #589

Vectorisation #589

Conversation

sv2518 commented Jun 27, 2020 • edited Loading

tj-sun left a comment

Choose a reason for hiding this comment

sv2518 commented Jul 1, 2020

wence- commented Jul 1, 2020

sv2518 commented Jul 1, 2020

wence- commented Jul 1, 2020

sv2518 commented Jul 1, 2020

wence- commented Jul 1, 2020

wence- commented Jul 1, 2020

sv2518 commented Jul 1, 2020

wence- commented Jul 1, 2020

wence- commented Jul 1, 2020

sv2518 commented Jul 1, 2020

wence- commented Jul 1, 2020

wence- commented Jul 2, 2020

wence- commented Jul 2, 2020

sv2518 commented Jul 3, 2020

wence- commented Jul 3, 2020

sv2518 commented Jul 6, 2020

wence- commented Jul 7, 2020

sv2518 commented Jul 7, 2020 • edited Loading

sv2518 commented Jul 7, 2020

wence- commented Jul 8, 2020

sv2518 commented Jul 8, 2020

sv2518 commented Jul 21, 2020

sv2518 commented Jul 22, 2020

wence- commented Jul 22, 2020

sv2518 commented Jul 22, 2020

sv2518 commented Jul 7, 2022

sv2518 commented Jun 27, 2020 •

edited

Loading

sv2518 commented Jul 7, 2020 •

edited

Loading